Red Wine Quality Exploration By Ramon Prieto

Introduction

The database that I’ll be exploring consists of red variants of the Portuguese “Vinho Verde” wine. The variables being explored are shown below.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Univariate Plots Section

Let’s first take a look at a summary of the red wine database variables to get a better idea of the data.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The mean and median wine quality are close to the mid point (quality is measured in a scale of 0 to 10 from the median of at least 3 evaluations made by wine experts).

## [1] "percentage of wines with quality 5 or 6"
## [1] 82.48906

The median and mean of wine quality is very representative of the entire dataset.

The distribution of alcohol in wine is interesting because we can see spikes at values ending with .5 or 0 which suggests that wineries tend to round their abv values. I expected abv and density to have an obvious relationship but density seems to be more affected by the amount of sugar in the wine rather than the volume of alcohol. As expected the density of the wine is very close to the density of water. The average density of the wines in the database is .997 g/cm^3 and the density of water is 1 g/cm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

It is surprising to see that wineries measure pH with high precision. Since the great majority of wines fall within the 3-3.5 pH range it will be interesting to investigate if wine quality is affected by small variations of acidity or if quality remains constant for a certain pH range. pH is normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

pH is a measure of acidity so it makes sense that the distributions for volatile, free acidity, and pH are all similar. The distribution for volatile acidity appears to be bimodal, while fixed acidity and pH have a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The distribution of fixed acidity and volatile acidity are similar to the pH distribution as expected. In the citric acid plot we again see peaks at exact values (0, .25, .5) and an almost uniform distribution, which is surprising since I expected it to have a strong correlation with pH. It might be worth it to explore this relationship.

The distribution of sugar and salt suggests that there is a dependency between the two. Since, both plots have a long positive tail lets apply a log transformation to them.

After the transformation the possible relation between the two concentrations becomes more obvious. We see a lone low value on each and a similar distribution for values on the higher end. The balance between the two seems like a possible variable with a strong correlation to wine quality and will be explored in more detail later. Meanwhile, lets see how the distributions of salt and sugar vary for wines of each quality

From these group of plots it looks like chloride (salt) and sugar in the wine doesn’t greatly impact its quality. The concentrations have approximately the same distribution for all wines. However, since the amount of data on wines with quality outside the 5-7 range is limited the patterns of their distributions are not fully formed.

## [1] "Free Sulfur Dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## [1] "Total Sulfur Dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## [1] "Sulphates"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Nothing surprising here. Low concentrations of SO2 in the wine and similar distributions for amount of SO2 and sulphates in the wine. A new variable, called SO2.ratio, is added to the “red” table to represent the amount of free SO2 relative to the total.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations in the dataset with 12 variables (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free SO2, total SO2, density, pH, sulphates, alcohol, quality). Only the quality is an ordered factor variable with the following levels.

quality: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

However, there are no quality observations outside of the 3-8 range and more than 80 percent are inside the 5-6 levels.

Other observations
* The mean wine quality is 5.64 and the madian is 6
* Most wines have a pH between 3 and 3.5
* There seems to be a dependency between concentrations of sugar and salt
* The average alcohol content is 10.42%

What is/are the main feature(s) of interest in your dataset?

The main feature in the dataset is quality. Although, the quality observations are limited I hope to determine which variables have the greatest impact on percieved quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Alcohol, pH, and citric acid are the most likely to influence the quality of the wine.

Did you create any new variables from existing variables in the dataset?

I created a ratio of free and total SO2 concentrations

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distributions for the concentrations of residual sugar and chlorides are long-tailed, so I performed a log transformation on them. The new distributions are close to normal. There is a clear relationship between salt and sugar concentrations.

Bivariate Plots and Analysis Section

To get a better idea of the relationship between the variables I’ll take a quick look at a correlation matrix. I’ll compare alcohol, density, pH, SO2 ratio, and quality to the other variables as those two are the ones that peaked my interest during the univariate analysis.

The correlation results are disappointing, there isn’t any strong correlation between the variables. However, from the matrix it looks like volatile.acidity, citric.acid, SO2.ratio, alcohol, sulphates, and density have the greater effect on a wine quality, which is the variable we are most interested in. Therefore, these variables will be analyzed in more detail.

I decided to explore all variables with correlations greater than .2 with quality.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

As volatile acidity decreases the quality of the wine decreases. The relationship appears to settle down once volatile acidity is around .4 g/L.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

This is a fun one. As alcohol content goes up so does the perceived quality. I wonder if that could be biased in some obsure way. As expected, wine density is dependent on the alcohol content.

This is a well known physical property and is of no interest in this study because density doesn’t not have a noticeable effect on wine quality as seen below.

Therefore, density is not a factor in the alcohol/quality relationship.

There is not much going on here. Quality appears to increase as the concentration of sulphates in the wine increases but the variation is too small to make anyting of it. Sulphates have a negative correlation with volatile acidity, so it makes sense that more sulphates could result in better wine quality as it would decrease volatile acidity.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Citric acid is known to add flavor and freshness to wines. This plot shows that is a good thing. Clearly an increase in citric acid in the wine tends to increase its quality. However, too much citric acid results in below average wines. The two wines with the highest citric acid concentration are below average quality (4 and 5). Unintuitively, volatile acidity decreases as citric acid and fixed acidity increase. Therfore, quality should also increase with fixed acidity.

Variations in fixed acidity does not have the expected impact on wine quality. The relationship between citric acid, fixed acidity, and quality is something I want to investigate in more detail. Meanwhile, lets get a better understanding about the differences between volatile and fixed acidity. Fixed acidity is found naturally in grapes or are created through the normal fermentation process. Volatile acids are produced through fermentations caried out by spoilage organisms. The most common volatile acid is acetic acid, which is produced by bacteria as it ferments wine into vinegar! This explains their different relationships with wine quality.

Let’s see if there is an observable relationship between the concentration of sugar and salt.

The amount of salt in wine looks to be mostly constant except for a few wines with higher concentrations. I wonder if this has anything to do wine quality.

This plot is misleading and not useful. It suggests that average wines tend to have larger salt or sugar concentrations. However, it only appears to be this way because we have many more datapoints at these quality values, which increases the possibility of outliers in our plot.

Multivariate Plots and Analysis Section

Since most of the wines fall within the 5-6 quality range it makes sense to create three categories for wines as to maximize the datapoints in each and have a better change to detect patterns. The categories I’ll use are below average (quality < 5), average (5 < quality < 7), and above average ( quality >= 7)

The data points for average clutter the plots, therefore, they will be removed.

Much better. This plot shows a clear quality distinction between wines with higher fixed acidity and citric acid concentration. The opposite is true for volatile acidity.

It’s interesting that most of the wines with zero citric acid concentration are below average in quality. I at first thought that those values were probably empty cells or missing values but this strongly suggests otherwise. Citric acid and alcohol are the most useful variables if we were to try and predict the quality of the wine.

It’s clear that above average wines tend to have higher concentrations of citric acid and alcohol. Using this insight lets revisit the volatile acidity vs citric acid scatter plot but lets plot only wines with an alcohol content larger than 11.

Below average wines are mostly eliminated. Lets see what happens if we increase the alcohol content to 12.

There is only one below average wine left! Further, we can see that most of the wines left have a citric acid concentration above 0.2. Clearly alcohol content can be used to filter out low quality wines.

I originally planned to build a model to determine wine quality based on its properties. However, I came to the conclusion that there was not enough data to make accurate and reliable prediction. What’s possible is to set a few conditions that if followed will decrease the changes of selecting a below average wine. In this cases my recommendation would be to buy wines with an alcohol content of 12% or above, a concentration of citric acid higher than 0.2 g/L and of volatile acidity lower than 0.4 g/L. This may not be able to predict the quality of the wine but it comes close to assuring it won’t be below average.

Final Plots and Summary

Plot One

Description One

This plot is essencial to understanding the limitations of our dataset. There are very few data on the lower and higher values of wine quality. Therefore, it is not plausible to develop a reliable prediction models and we must be careful when exploring the data not to confuse patterns with lack of data. The distribution of wine quality is normal.

Plot Two

Description Two

Alcohol content and citric acid are the two variables that have the most impact on wine quality. As either of the two variables increases so does the average wine quality. The mean concentration difference between the lowest quality and highest quality categories for alcohol and citric acid are 2.1% and .22 g/L respectively.

Plot Three

Description Three

## [1] "% Below Average"
## [1] 7.843137
## [1] "% Above Average"
## [1] 92.15686

This plot is the combination of the three main variables, it clearly shows that wines with low volatile acidity and relatively high citric acid and alcohol concentrations tend to be of higher quality. In fact only about 8 percent of wines that meet these parameters were of below average quality.

Reflection

The red wine quality database contains information about the chemical properties of 1599 different red wines. I started my exploration of the data by getting a quick understanding of each individual property and then used than understanding to investigate their relationship with the quality of the wine and the other variables. I found the data to be lacking and decided against creating a predictive model for wine quality. Instead, I searched for the variables with the stronger correlation with wine quality to create parameters that could be used to minimize the probability of selecting a wine of below average quality.

There was a clear tendency for wine to increase in quality as alcohol and citric acid content incresed and volatile acidity decreased. I was surprised to see that residual sugars didn’t have much of an effect on the wine.

The main issue with this database was its size. With a much larger wine dataset we could better establish patterns and discover new ones. Also, all of the data comes from “Vinho Verde” wines, which come from the north of Portugal and is generally a young wine. It would be much more interesting to analyse data from wines all over the world and explore variables such as age, kind of grape, year of harvest, and region.